-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
telemetry: add Agent TLS Certificate expiration metric #10768
Conversation
🤔 This PR has changes in the |
1. do not emit the metric if Query fails 2. properly check for PrimaryUsersIntermediate, the logic was inverted Also improve the logging by including the metric name in the log message
2318182
to
9420506
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here looks good and correct but I have a few meta points of feedback for potential future enhancements.
First is that metrics that are conditionally emitted only by the leader are harder to deal with than metrics emitted by all servers (due to stale values emitted by the previous leader hanging around for a while). It might be a worthwhile change to have all servers emit the connect ca expiry metrics if possible.
Secondly, it could be useful to have certificate updates trigger immediate metrics emission. I am thinking of a scenario where the agent tls certificate is being monitored and it gets below some threshold. Someone gets paged, fixes the certificate, and reloads Consul. However it may take another hour until the metric goes back to normal. During that time the monitor might keep firing. It would be a minor quality of life improvement to get the metrics updated when the cert is updated to allow any monitors to auto-resolve.
ticker := time.NewTicker(certExpirationMonitorInterval) | ||
defer ticker.Stop() | ||
|
||
logger := m.Logger.With("metric", strings.Join(m.Key, ".")) | ||
|
||
for { | ||
select { | ||
case <-ctx.Done(): | ||
return nil | ||
case <-ticker.C: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this mean that the metric/logs do not get emitted until the monitoring interval has passed (so after an hour)? Or does the ticker fire immediately and then again after each interval?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not fire immediately. I think that would be a good improvement to handle the scenario you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that in #10771.
I haven't run into this problem with datadog/statsd. Is it mostly an issue with prometheus? Would it be sufficient to have the non-leaders emit |
Going to address follow ups in #10771 pending discussion about emitting NaN vs the metric on all Servers. |
🍒 If backport labels were added before merging, cherry-picking will start automatically. To retroactively trigger a backport after merging, add backport labels and re-run https://circleci.com/gh/hashicorp/consul/422411. |
This should be the last of the new metrics for #9891
Adds a new metric with the number of seconds until the Agent TLS certificate expires, updated hourly.
Also fixes a couple bugs in the existing metrics, and makes one improvement to the error logging.
Tested using
socat -d - udp6-listen:8125
and./consul agent -dev -hcl='telemetry {statsd_address = "localhost:8125"}'
along with some agent config to set the CA and certificate for agent tls.